91 research outputs found

    Towards Data-Driven Autonomics in Data Centers

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

    Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...

    Gene regulatory network modelling with evolutionary algorithms -an integrative approach

    Get PDF
    Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network (GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However, large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is straightforward and qualitative improvements are obtained when combining predicted interactions from single-channel and RNA-seq datasets

    Egalitarianism in the rank aggregation problem: a new dimension for democracy

    Get PDF
    Winner selection by majority, in an election between two candidates, is the only rule compatible with democratic principles. Instead, when the candidates are three or more and the voters rank candidates in order of preference, there are no univocal criteria for the selection of the winning (consensus) ranking and the outcome is known to depend sensibly on the adopted rule. Building upon XVIII century Condorcet theory, whose idea was to maximize total voter satisfaction, we propose here the addition of a new basic principle (dimension) to guide the selection: satisfaction should be distributed among voters as equally as possible. With this new criterion we identify an optimal set of rankings. They range from the Condorcet solution to the one which is the most egalitarian with respect to the voters. We show that highly egalitarian rankings have the important property to be more stable with respect to fluctuations and that classical consensus rankings (Copeland, Tideman, Schulze) often turn out to be non optimal. The new dimension we have introduced provides, when used together with that of Condorcet, a clear classification of all the possible rankings. By increasing awareness in selecting a consensus ranking our method may lead to social choices which are more egalitarian compared to those achieved by presently available voting systems.Comment: 18 pages, 14 page appendix, RateIt Web Tool: http://www.sapienzaapps.it/rateit.php, RankIt Android mobile application: https://play.google.com/store/apps/details?id=sapienza.informatica.rankit. Appears in Quality & Quantity, 10 Apr 2015, Online Firs

    A Big Data Analyzer for Large Trace Logs

    Full text link
    Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.Comment: 26 pages, 10 figure

    Opinion dynamics with disagreement and modulated information

    Get PDF
    Opinion dynamics concerns social processes through which populations or groups of individuals agree or disagree on specific issues. As such, modelling opinion dynamics represents an important research area that has been progressively acquiring relevance in many different domains. Existing approaches have mostly represented opinions through discrete binary or continuous variables by exploring a whole panoply of cases: e.g. independence, noise, external effects, multiple issues. In most of these cases the crucial ingredient is an attractive dynamics through which similar or similar enough agents get closer. Only rarely the possibility of explicit disagreement has been taken into account (i.e., the possibility for a repulsive interaction among individuals' opinions), and mostly for discrete or 1-dimensional opinions, through the introduction of additional model parameters. Here we introduce a new model of opinion formation, which focuses on the interplay between the possibility of explicit disagreement, modulated in a self-consistent way by the existing opinions' overlaps between the interacting individuals, and the effect of external information on the system. Opinions are modelled as a vector of continuous variables related to multiple possible choices for an issue. Information can be modulated to account for promoting multiple possible choices. Numerical results show that extreme information results in segregation and has a limited effect on the population, while milder messages have better success and a cohesion effect. Additionally, the initial condition plays an important role, with the population forming one or multiple clusters based on the initial average similarity between individuals, with a transition point depending on the number of opinion choices

    Data integration for microarrays: enhanced inference for gene regulatory networks

    Get PDF
    Microarray technologies have been the basis of numerous important findings regarding gene expression in the last decades. Studies have generated large amounts of data describing various processes, which, due to the existence of public databases, are widely available for further analysis. Given their lower cost and higher maturity compared to newer sequencing technologies, these data continue to be produced, even though data quality has been the subject of some debate. However, given the large volume of data generated, integration can help overcome some issues related e.g. to noise or reduced time resolution, while providing additional insight on features not directly addressed by sequencing methods. Here we present an integration test case based on public Drosophila melanogaster datasets (gene expression, binding site affinities, known interactions). Using an evolutionary computation framework, we show how integration can enhance the ability to recover transcriptional gene regulatory networks from these data, as well as indicating which data types are more important for quantitative and qualitative network inference. Our results show a clear improvement in performance when multiple data sets are integrated, indicating that microarray data will remain a valuable and viable resource for some time to come

    Algorithmic bias amplifies opinion polarization: A bounded confidence model

    Get PDF
    The flow of information reaching us via the online media platforms is optimized not by the information content or relevance but by popularity and proximity to the target. This is typically performed in order to maximise platform usage. As a side effect, this introduces an algorithmic bias that is believed to enhance polarization of the societal debate. To study this phenomenon, we modify the well-known continuous opinion dynamics model of bounded confidence in order to account for the algorithmic bias and investigate its consequences. In the simplest version of the original model the pairs of discussion participants are chosen at random and their opinions get closer to each other if they are within a fixed tolerance level. We modify the selection rule of the discussion partners: there is an enhanced probability to choose individuals whose opinions are already close to each other, thus mimicking the behavior of online media which suggest interaction with similar peers. As a result we observe: a) an increased tendency towards polarization, which emerges also in conditions where the original model would predict convergence, and b) a dramatic slowing down of the speed at which the convergence at the asymptotic state is reached, which makes the system highly unstable. Polarization is augmented by a fragmented initial population
    corecore